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Abstract 

In various applications involving hidden Markov models (HMMs), some of 
the hidden states are aliased, having identical output distributions. The minimal¬ 
ity, identifiability and leamability of such aliased HMMs have been long standing 
problems, with only partial solutions provided thus far. In this paper we focus 
on parametric-output HMMs, whose output distributions come from a parametric 
family, and that have exactly two aliased states. For this class, we present a com¬ 
plete characterization of their minimality and identifiability. Furthermore, for a 
large family of parametric output distributions, we derive computationally efficient 
and statistically consistent algorithms to detect the presence of aliasing and learn 
the aliased HMM transition and emission parameters. We illustrate our theoretical 
analysis by several simulations. 


1 Introduction 

HMMs are a fundamental tool in the analysis of time series. A discrete time HMM 
with n hidden states is characterized by a n x n transition matrix, and by the emis¬ 
sions probabilities from these n states. In several applications, the HMMs, or more 
general processes such as partially observable Markov decision processes, are aliased, 
with some states having identical output distributions. In modeling of ion channel gat¬ 
ing, for example, one postulates that at any given time an ion channel can be in only 
one of a finite number of hidden states, some of which are open and conducting cur¬ 
rent while others are closed, see e.g. Fredkin & Rice [1992]. Given electric current 
measurements, one fits an aliased HMM and infers important biological insight regard¬ 
ing the gating process. Other examples appear in the fields of reinforcement learning 
[Chrisman, 1992, McCallum, 1995, Brafman & Shani, 2004, Shani et al., 2005] and 
robot navigation [Jefferies & Yeap, 2008, Zatuchna & Bagnall, 2009]. In the latter 
case, aliasing occurs whenever different spatial locations appear (statistically) identical 
to the robot, given its limited sensing devices. As a last example, HMMs with sev¬ 
eral silent states that do not emit any output [Leggetter (& Woodland, 1994, Stanke & 
Waack, 2003, Brejova et al., 2007], can also be viewed as aliased. 
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Key notions related to the study of HMMs, be them aliased or not, are their mini¬ 
mality, identifiability and learnability: 

Minimality. Is there an HMM with fewer states that induces the same distribution 
over all output sequences? 

Identifiability. Does the distribution over all output sequences uniquely deter¬ 
mines the HMM’s parameters, up to a permutation of its hidden states? 

Learning. Given a long output sequence from a minimal and identifiable HMM, 
efficiently learn its parameters. 

For non-aliased HMMs, these notions have been intensively studied and by now 
are relatively well understood, see for example Petrie [1969], Finesso [1990], Leroux 
[1992], Allman et al. [2009] and Cappe et al. [2005]. The most common approach 
to learn the parameters of an HMM is via the Baum-Welch iterative algorithm [Baum 
et al., 1970]. Recently, tensor decompositions and other computationally efficient spec¬ 
tral methods have been developed to learn non-aliased HMMs [Hsu et al., 2009, Siddiqi 
et al., 2010, Anandkumar et al., 2012, Kontorovich et al., 2013]. 

In contrast, the minimality, identifiability and learnability of aliased HMMs have 
been long standing problems, with only partial solutions provided thus far. For ex¬ 
ample, Blackwell & Koopmans [1957] characterized the identifiability of a specific 
aliased HMM with 4 states. The identifiability of deterministic output HMMs, where 
each hidden state outputs a deterministic symbol, was partially resolved by Ito et al. 
[1992]. To the best of our knowledge, precise characterizations of the minimality, 
identifiability and learnability of probabilistic output HMMs with aliased states are 
still open problems. In particular, the recently developed tensor and spectral methods 
mentioned above, explicitly require the HMM to be non-aliasing, and are not directly 
applicable to learning aliased HMMs. 

Main results. In this paper we study the minimality, identifiability and learnability of 
parametric-output HMMs that have exactly two aliased states. This is the simplest pos¬ 
sible class of aliased HMMs, and as shown below, even its analysis is far from trivial. 
Our main contributions are as follows: First, we provide a complete characterization 
of their minimality and identifiability, deriving necessary and sufficient conditions for 
each of these notions to hold. Our identifiability conditions are easy to check for any 
given 2-aliased HMM, and extend those of Ito et al. [1992] for the case of determinis¬ 
tic outputs. Second, we solve the problem of learning a possibly aliased HMM, from a 
long sequence of its outputs. To this end, we first derive an algorithm to detect whether 
an observed output sequence corresponds to a non-aliased HMM or to an aliased one. 
In the former case, the HMM can be learned by various methods, such as Anandkumar 
et al. [2012], Kontorovich et al. [2013]. In the latter case we show how the aliased states 
can be identified and present a method to recover the HMM parameters. Our approach 
is applicable to any family of output distributions whose mixtures are efficiently leam- 
able. Examples include high dimensional Gaussians and products distributions, see 
Feldman et al. [2008], Belkin & Sinha [2010], Anandkumar et al. [2012] and references 
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therein. After learning the output mixture parameters, our moment-based algorithm re¬ 
quires only a single pass over the data. It is possibly the first statistically consistent and 
computationally efficient scheme to handle 2-aliased HMMs. While our approach may 
be extended to more complicated aliasing, such cases are beyond the scope of this pa¬ 
per. We conclude with some simulations illustrating the performance of our suggested 
algorithms. 

2 Definitions & Problem Setup 

Notation. We denote by /„ the nxn identity matrix and 1)^ G M". For 

V G M", diag(n) is the nxn diagonal matrix with entries Vi on its diagonal. The Tth 
row and column of a matrix A G are denoted by and respectively. 

We also denote [n] = {1,2,..., n}. For a discrete random variable X we abbreviate 
P{x) for Pr(Ar = x). For a second random variable Z, the quantity P{z \ x) denotes 
either Pr(Z = z\X = x),or the conditional density p{Z = z\X = x), depending on 
whether Z is discrete or continuous. 

Hidden Markov Models. Consider a discrete-time HMM with n hidden states {1,..., n}, 
whose output alphabet y is either discrete or continuous. Let Pq = {fg : 3^ —)■ M | 6 * € 

0} be a family of parametric probability density functions where 0 is a suitable pa¬ 
rameter space. A parametric-output HMM is defined by a tuple H = (A, 6, tt®) where 
A is the nxn transition matrix of the hidden states 

A,j = PT{Xt+i =i\Xt=j)= P{i I j), 

7 r° G M" is the distribution of the initial state, and the vector of parameters 6 = 

( 6 * 1 , 02 , ■ • • 5 ^n) € 0 ” determines the n probability density functions {fg^^, fg^,..., fg^). 

The output sequence of the HMM is generated as follows. First, an unobserved 
Markov sequence of hidden states x = {xt)JSQ is generated according to the distribu¬ 
tion 

T-l 

=^xo n p(xt\xt-i). 

The output yt G y time t depends only on the hidden state Xt via P{yt \ Xt) = 

(y^)- Hence the conditional distribution of an output sequence y = {yt)JPQ is 

T-l T-l 

P{y I a:) = n P{yt I = n 

We denote by PH,k : iV* —> M the joint distribution of the first k consecutive outputs 
of the HMM P[. For y = (j/o, • ■ •, Vk-i) & this distribution is given by 

PH,k{y)= 51 P{y\x)P{x). 

Further we denote by Ph = {PH,k | fc > 1} the set of all these distributions. 
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2-Aliased HMMs. For an HMM H with output parameters 9 = {9i,92, ■ ■ ■ ,9n) & 

0” we say that states i and j are aliased if 9i = 9j. In this paper we consider the 
special case where H has exactly two aliased states, denoted as 2A-HMM. Without 
loss of generality, we assume the aliased states are the two last ones, n—1 and n. Thus, 

01 7 ^ 02 7 ^ ■ ■ ■ On-2 ^ 9n-l and 9n-l = 9n- 

We denote the vector of the rir-l unique output parameters of iF by 0 = (0i, 02 > • • ■, dn- 2 , 6n-i ) € 
0”“^. For future use, we define the aliased kernel K G as the matrix of 

inner products between the n— 1 different /e/s, 

Kij = = f /e.(t/)/e,( 2 /)dt/, i,jG[n-l]. (1) 

Jy 

Assumptions. As in previous works [Leroux, 1992, Kontorovich et ah, 2013], we 
make the following standard assumptions: 

(Al) The parametric family IFg of the output distributions is linearly independent of 
order n: for any distinct ^ifoi = 0 iff = 0 for all i G [nj. 

(A2) The transition matrix A is ergodic and its unique stationary distribution tt = 

(tti, 7r2,..., 7r„) is positive. 


Note that assumption (Al) implies that the parametric family Fg is identifiable, namely 
fe = fe' iff 0 = 0'. It also implies that the kernel matrix K of (1) is full rank n — 1. 

3 Decomposing the transition matrix A 

The main tool in our analysis is a novel decomposition of the 2A-HMM’s transition 
matrix into its non-aliased and aliased parts. As shown in Lemma 1 below, the aliased 
part consists of three rank-one matrices, that correspond to the dynamics of exit from, 
entrance to, and within the two aliased states. This decomposition is used to derive 
the conditions for minimality and identifiability (Section 4), and plays a key role in 
learning the HMM (Section 5). 

To this end, we introduce a pseudo-state n, combining the two aliased states n—1 
and n. We define 


TTft = TTn-1 + 7r„ and /3 = TTn-lAn- 


( 2 ) 
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We shall make extensive use of the following two matrices: 



As explained below, these matrices can be viewed as projection and lifting operators, 
mapping between non-aliased and aliased quantities. 


Non-aliased part. The non-aliased part of A is a stochastic matrix A e ^ ("-i) ^ 

obtained by merging the two aliased states n—1 and n into the pseudo-state n. Its entries 
are given by 




P(l|n) ' 

A = 

^[l:n-2]x[l:n-2] 

P{n — 2 n) 


^ P{n 1) ... P{n n—2) 

P{n 1 n) ^ 


where the transition probabilities into the pseudo-state are 

P{n\j) = P{n-l\j) + P{n\j), Vj e [n], 

the transition probabilities out o/the pseudo-state are defined with respect to the sta¬ 
tionary distribution by 

P{i\n) = f3P{i\n—l) + {l — f3)P{i\n), Vi € [n] 

and lastly, the probability to stay in the pseudo-state is 

P{n\n) = f3P{n\n—l) + {l — (3)P{n\n). 

It is easy to check that the unique stationary distribution of A is 7f = (7ri,7r2,... ,7r„_2,7rfi) G 
]gn-i j^ote that A = BACp, tt = Btt and tt = C/jTf, justifying the lifting and 

projection interpretation of the matrices B^Cp. 


Aliased part. Next we present some key quantities that distinguish between the two 
aliased states. Let supp;,, = {j G [n] \ P{n \ j) > 0} be the set of states that can move 
into either one of the aliased states. We define 



P(n-1 I i) 
P{n\j) 

0 


j G supp,„ 
otherwise, 


5 


( 4 ) 



as the relative probability of moving from state j to state n— 1, conditional on moving to 
either n—1 orn. We define the two vectors g as follows: Vi, j € [rt—1], 

P(i I n— 1) — P(i I n) i < n — 1 
P{n I n—1) — P(n | n) i = n — 1 

{aj-P)P{n\j) j<n-l 

/3(a„_i-/?)P(n|n-l) 

+ (l-/3)(a„-/3)P(n|n) j = n-l 

In other words, captures the differences in the transition probabilities out o/the 
aliased states. In particular, if = 0 then starting from either one of the two 
aliased states, the Markov chain evolution is identical. Intuitively such an HMM is 
not minimal, as its two aliased states can be lumped together, see Theorem 1 below. 

Similarly, compares the relative probabilities into the aliased states aj, to the 
stationary relative probability /? = 7r„_i/7rfi. This quantity also plays a role in the 
minimality of the HMM. 

Lastly, for our decomposition, we define the scalar 

K = {an -1 - (3)P{n I n-1) - (a„ - P)P{n \ n). (7) 

Decomposing A. The following lemma provides a decomposition of the transition 
matrix in terms of A, k and P (all omitted proofs are given in the Appendix). 

Lemma 1. The transition matrix A of a 2A-HMM can be written as 

A = CpAB + + biS'yB + k be}, (8) 

where = (0,..., 0,1 —/3, —/3) € M” and b = (0,..., 0,1, —l)"^ G K”. 

In (8), the first term is the merged transition matrix A g lifted back into 

IR"X". This term captures all of the non-aliased transitions. The second matrix is zero 
except in the last two columns, accounting for the exit transition probabilities from 
the two aliased states. Similarly, the third matrix is zero except in the last two rows, 
differentiating the entry probabilities. The fourth term is non-zero only on the lower 
right 2x2 block involving the aliased states n—1, n. This term corresponds to the 
internal dynamics between them. Note that each of the last three terms is at most a 
rank-1 matrix, which together can be seen as a perturbation due to the presence of 
aliasing. 

In section 5 we shall see that given a long output sequence from the HMM, the 
presence of aliasing can be detected and the quantities A, 5'", k and /3 can all be 
estimated from it. An estimate for A is then obtained via Eq. (8). 

4 Minimality and Identifiability 

Two HMMs H and H' are said to be equivalent if their observed output sequences are 
statistically indistinguishable, namely Ph' = Vh- Similarly, an HMM H is minimal if 




(5) 

(6) 
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there is no equivalent HMM with fewer number of states. Note that if H is non-aliased 
then Assumptions (A1-A2) readily imply that it is also minimal [Leroux, 1992]. In this 
section we present necessary and sufficient conditions for a 2A-HMM to be minimal, 
and for two minimal 2A-HMMs to be equivalent. Finally, we derive necessary and 
sufficient conditions for a minimal 2A-HMM to be identifiable. 

4.1 Minimality 

The minimality of an HMM is closely related to the notion of lumpability: can hidden 
states be merged without affecting the distribution Vh [Fredkin & Rice, 1986, White 
et al., 2000, Huang et al., 2014]. Obviously, an HMM is minimal iff no subset of 
hidden states can be merged. In the following theorem we give precise conditions for 
the minimality of a 2A-HMM. 

Theorem 1. Let H be a 2A-HMM satisfying Assumptions (A1-A2) whose initial state 
Xq is distributed according to = (tTi, ■ ■ ■, /3°7r?, (1 — /3°)7r?). Then, 

(i) and j3^ j3 then H is minimal iff 8°'^^ 7 ^ 0. 

(ii) If TT^ = 0 or j3^ = P then H is minimal iff both 7 ^ 0 and f 0. 

By Theorem 1, a necessary condition for minimality of a 2A-HMM is that the two 
aliased states have different exit probabilities, 0. Namely, there exists a non- 

aliased state % e [n—2] such that P{i \ n—1) 7 ^ P{i \ n). Otherwise the two aliased 
states can be merged. If the 2A-HMM is started from its stationary distribution, then 
an additional necessary condition is 7 ^ 0. This last condition implies that there is a 
non-aliased state j G supPi„ \{n — 1 , n} with relative entrance probability aj 7 ^ p_ 

4.2 Identifiability 

Recall that an HMM H is (strictly) identifiable if Vh uniquely determines the transition 
matrix A and the output parameters 9, up to a permutation of the hidden states. We 
establish the conditions for identifiability of a 2A-HMM in two steps. First we derive a 
novel geometric characterization of the set of all minimal HMMs that are equivalent to 
H, up to a permutation of the hidden states (Theorem 2). Then we give necessary and 
sufficient conditions for H to be identifiable, namely for this set to be the singleton set, 
consisting of only H itself (Appendix C). In the process, we provide a simple procedure 
(Algorithm 1) to determine whether a given minimal 2A-HMM is identifiable or not. 

Equivalence between minimal 2A-HMMs. Necessary and sufficient conditions for 
the equivalence of two minimal HMMs were studied in several works [Finesso, 1990, 
Ito et al., 1992, Vanluyten et al., 2008]. We now provide analogous conditions for 
parametric output 2A-HMMs. Toward this end, we define the following 2-dimensional 
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family of matrices S'(r„_i, r„) G given by 



( 


o 

O 


In-2 


0 

0 


0 ... 

0 

'^n—1 

Tn 


^0 ... 

0 


1-T„ . 


Clearly, for t„_i ^ r„, S is invertible. As in Ito et al. [1992], consider then the 
following similarity transformation of the transition matrix A, 

AniTn-ljTn) = S {Tu-I, Tn)~^ AS Tn) ■ (9) 

It is easy to verify that I^Ah = 1^- However, Ah is not necessarily stochastic, as 
depending on t„_i, t„ it may have negative entries. The following lemma resolves the 
equivalence of 2A-HMMs, in terms of this transformation. 

Lemma 2. Let H = [A, 0, tv) be a minimal 2A-HMM satisfying Assumptions (A1-A2). 
Then a minimal HMM H' = {A', O', it') with n' states is equivalent to H iff n' = n 
and there exists a permutation matrix H G and t„_i > r„ such that 6 ' = AO 

and 


tt' = nS'(r„_i,T„) V 
A' = nA//(r„_i,T„)n“^ > 0. 

The feasible region. By Lemma 2, any matrix A^f(T„_i,T„) whose entries are all 
non-negative yields an HMM equivalent to the original one. We thus define the feasible 
region of H by 

Tff = {(r„_i,T„) G I A//(r„_i,r„) > 0, t„_i >r„}. (10) 

By definition, Th is non-empty, since r„) = (1, 0) recover the original ma¬ 

trix A. As we show below, Th ts determined by three simpler regions ri,r2,r3 c 
The region Ti ensures that all entries of Ah are non-negative except possibly in the 
lower right 2x2 block corresponding to the two aliased states. The regions r 2 and r 3 
ensure non-negativity of the latter, depending on whether the aliased relative probabil¬ 
ities of (4) satisfy a„_i > a„ or q;„_i < a„, respectively. For ease of exposition we 
assume as a convention that P{n\n—1) > P{n\n). 

Theorem 2. Let H be a minimal 2A-HMM satisfying Assumptions (A1-A2). There 
exist (r“,T+) G and convex monotonic decreasing 

functions ^ : M —)■ K such that 

CXji—i ^ (Xji 

CXji—l ^ tXjij 
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Th = 




where the regions ri,r2,r3 C are given by 


-p _ r_min _max] r^max 
^ 1 ['n—1 5 'n—1 \ ^ I'n f ' i 


max ^min] 

n J 


T2 = [r+,oo)x[r ,r+] 


Ts = {(r„_i,T„) e Ti I /(r„_i) < Tn < }. 

In addition, the set Tjj is connected. 

The feasible regions in the two possible cases (q;„_i > or a„_i < «„) are depicted 
in Appendix C, Fig.6. 


Strict Identifiability. By Lemma 2, for strict identifiability of H,Th should be the 
singleton set L// = {(1,0)}. Due to lack of space, sufficient and necessary conditions 
for this to hold, as well as a corresponding simple procedure to determine whether a 
2A-HMM is identifiable, are given in Appendix C.2. 

Remark. While beyond the scope of this paper, we note that instead of strict iden¬ 
tifiability of a given HMM, several works studied a different concept of generic identi¬ 
fiability [Allman et ah, 2009], proving that under mild conditions the class of HMMs is 
generically identifiable. In contrast, if we restrict ourselves to the class of 2A-HMMs, 
then our Theorem 2 implies that this class is generically non-identifiable. The reason is 
that by Theorem 2, for any 2A-HMM whose matrix A has all its entries positive, there 
are an infinite number of equivalent 2A-HMMs, implying non-identifiability. 


5 Learning a 2A-HMM 

Let {Yt)JSQ be an output sequence generated by a parametric-output HMM that sat¬ 
isfies Assumptions (A1-A2) and initialized with its stationary distribution, Xq ~ tt. 
We assume the HMM is either non-aliasing, with n—1 states, or 2-aliasing with n 
states. We further assume that the HMM is minimal and identifiable, as otherwise its 
parameters cannot be uniquely determined. 

In this section we study the problems of detecting whether the HMM is aliasing and 
recovering its output parameters 9 and transition matrix A, all in terms of {Yt)J~Q. 

High level description. The proposed learning procedure consists of the following 
steps (see Fig.l); 

(i) Determine the number of output components n—1 and estimate the n—1 unique 
output distribution parameters 9 and the projected stationary distribution tt. 

(ii) Detect if the HMM is 2-aliasing. 

(iii) In case of a non-aliased HMM, estimate the (n—1) x (n—1) transition matrix 
A, as for example in Kontorovich et al. [2013] or Anandkumar et al. [2012]. 

(iv) In case of a 2-aliased HMM, identify the component 0„_i corresponding to the 
two aliased states, and estimate the n x n transition matrix A. 
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Figure 1: Learning a 2A-HMM. 


We now describe in detail each of these steps. As far as we know, our learning proce¬ 
dure is the first to consistently learn a 2A-HMM in a computationally efficient way. In 
particular, the solutions for problems (ii) and (iv) are new. 

Estimating the output distribution parameters. As the HMM is stationary, each 
observable Yt is a random realization from the following parametric mixture model, 

n—1 

( 11 ) 

Hence, the number of unique output components n — 1, the corresponding output pa¬ 
rameters 0 and the projected stationary distribution tt can be estimated by fitting a 
mixture model (11) to the observed output sequence {Yt)JSQ. 

Consistent methods to determine the number of components in a mixture are well 
known in the literature [Titterington et ak, 1985]. The estimation of 0 and tt is typically 
done by either applying an EM algorithm, or any recently developed spectral method 
[Dasgupta, 1999, Achlioptas & McSherry, 2005, Anandkumar et ak, 2012]. As our 
focus is on the aliasing aspects of the HMM, in what follows we assume that the number 
of unique output components n—1, the output parameters 6 and the projected stationary 
distribution tt are exactly known. As in Kontorovich et ak [2013], it is possible to show 
that our method is robust to small perturbations in these quantities (not presented). 

5.1 Moments 

To solve problems (ii), (iii) and (iv) above, we first introduce the moment-based quan¬ 
tities we shall make use of. Given 0 and tt or estimates of them, for any i, j G [n— 1], 
we define the second order moments with time lag t by 

Mif =E[fe^{Yo)fe^(Yt)], ie {1,2,3}. (12) 
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The consecutive in time third order moments are defined by 


g'lf =nfe.{Yo)hAYi)fe,{Y 2 )], VcG[n-l]. (13) 

We also define the lifted kernel, JC = B^KB G One can easily verify that for a 

2A-HMM, 

= KBA^Ci)Ai&g{Tv)k (14) 

= KBA(lia,g{lC[.^c])ACp(i\ag{Tt)K. (15) 

Next we define the kernel free moments M^*'\ G as follows: 

M(^) = Ai&g{Tt)-^ (16) 

gG) = if-iei(^)iT-idiag(7f)-^ (17) 

Note that by Assumption (Al), the kernel R is full rank and thus k~^ exists. Similarly, 

by (A2) 7f > 0, so diag(7f)“^ also exists. Thus, (16,17) are well defined. 

Let G be given by 

^(2) ^ ^(2) _ (^(l))2 (jg) 

i?(3) = (19) 

p(c) ^ q(c) _ j^(i) diag(A:[._^])M(i). (20) 


The following key lemma relates the moments (18, 19, 20) to the decomposition (8) of 
the transition matrix A. 

Lemma 3. Let H be a minimal 2A-HMM with aliased states n— 1 and n. Let A, 5°“*, 
(5‘” and k be defined in (3,5,6,7) respectively. Then the following relations hold: 


= A (21) 

77(2) ^ (22) 

77(3) = k77(2) (23) 

F(^) = kn-i,cR‘'^\ VcG[n-l]. (24) 


In the following, these relations will be used to detect aliasing, identify the aliased 
states and recover the aliased transition matrix A. 

Empirical moments. In practice, the unknown moments (12,13) are estimated from 
the output sequence (YAJSq by 

T-t-l 

= YR-t H fe.{Yi)U^{Yi+t), 

1=0 
^ T-3 

Gif = j, fe,iYi)fgfiYi+i)fgfiYi+2)- 

^ 1=0 
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With K, TT known, the corresponding empirical kernel free moments are given by 

diagiir)-^ (25) 

q{c) ^ K-ig(=)^-idiag(7f)-^ (26) 

The empirical estimates for (18,19,20) similarly follow. 

To analyze the error between the empirical and population quantities, we make the 
following additional assumption: 

(A3) The output distributions are bounded. Namely there exists L > 0 such that 
Vi e [n] and Vy G y, fg. (y) < L. 

Lemma 4. Let {Yt)JSQ be an output sequence generated by an HMM satisfying As- 
sumptions (A1-A3). Then, as T ^ oo, for any t G {1, 2, 3} and c G [n— 1], all error 
terms M*'> - R^*'> and are Op{T-^). 

In fact, due to strong mixing, all of the above quantities are asymptotically normally 
distributed [Bradley, 2005]. 

5.2 Detection of aliasing 

We now proceed to detect if the HMM is aliased (step (ii) in Fig.l). We pose this as a 
hypothesis testing problem: 

"Ho : H is non-aliased with n— 1 states 
vs. 

%i : iV is 2-aliased with n states. 

We begin with the following simple observation: 

Lemma 5. Let FI be a minimal non-aliased HMM with n—1 states, satisfying Assump¬ 
tions (A1-A3). Then = 0. 

In contrast, if H is 2-aliasing then according to (22) we have 
In addition, since the HMM is assumed to be minimal and started from the stationary 
distribution. Theorem 1 implies that both f 0 and 0. Thus is exactly 
a rank-1 matrix, which we write as 

FtfY) = auv' with ||m ||2 = ||'f |!2 = !> cr > 0, (27) 

where a is the unique non-zero singular value of R^^'i. Hence, our hypothesis testing 
problem takes the form: 

"Ho : = 0 vs. "Hi : = auv' with cr > 0. 

In practice, we only have the empirical estimate . Even if cr = 0, this matrix is 
typically full rank with n—1 non-zero singular values. Our problem is thus detecting 
the rank of a matrix from a noisy version of it. There are multiple methods to do so. 
In this paper, motivated by Kritchman & Nadler [2009], we adopt the largest singular 
value d\ of as our test statistic. The resulting test is 

if O'! > hp return Tfi, otherwise return Rq, (28) 
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where hr is a predefined threshold. By Lemma 4, as T —> oo the singular values of 
converge to those of . Thus, as the following lemma shows, with a suitable 
threshold this test is asymptotically consistent. 

Lemma 6. Let H be a minimal HMM satisfying Assumptions (A1 -A3) which is either 
non-aliased or 2-aliased. Then for any 0<e< the test (28) with Ht = n(r“ 2 +'^) is 
consistent: as T ^ oo, with probability one, it will correctly detect whether the HMM 
is non-aliased or 2-aliased. 


Estimating the non-aliased transition matrix A. If the HMM was detected as non¬ 
aliasing, then its (n— 1) x (n— 1) transition matrix A can be estimated for example by 
the spectral methods given in Kontorovich et al. [2013] or Anandkumar et al. [2012]. 
It is shown there, that these methods are (strongly) consistent. Moreover, as T — oo, 

A = A + Op{T-^). (29) 

5.3 Identifying the aliased component 6*„_i 

Assuming the HMM was detected as 2-aliasing, our next task, step (iv), is to identify 
the aliased component. Recall that if the aliased component is 0„_i, then by (24) 

VcG [n-lj. 

We thus estimate the index i G [n—1] of the aliased component by solving the following 
least squares problem: 

i = argmin E (30) 

celn-l] ’ " 

The following result shows this method is consistent. 

Lemma 7. For a minimal 2A-HMM satisfying Assumptions (Al-A3) with aliased states 
n— 1 and n, 

lim Pr(i A n—1) = 0. 

T —>-oo 


5.4 Learning the aliased transition matrix A 

Given the aliased component, we estimate the n x n transition matrix A using the 
decomposition (8). First, recall that by (22), = auv^. As singular 

vectors are determined only up to scaling, we have that 

S°'^^ = yu and S™ = —v, 

7 

where 7 G M is a yet undetermined constant. Thus, the decomposition (8) of A takes 
the form: 


A = CjsAB -f -\ — bv^B -\- k bc^. 
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Given that A, a, u and v are known from previous steps, we are left to determine the 
scalars 7 , /3 and k of Eq. (7). 

As for K, according to (23) we have Thus, plugging the empirical 

versions, k is estimated by 


K = argmin 

rGR 




(32) 


To determine 7 and /3 we turn to the similarity transformation t„), given 

in (9). As shown in Section 3, this transformation characterizes all transition matrices 
equivalent to A. To relate Ah to the form of the decomposition (31), we reparametrize 
T„_i and T„ as follows: 


II \ a! n 

7 = 7 ('rn-i - r„), P = -. 

‘^n—1 '^n 

Replacing t„_i, t„ with 7 ', P' we find that Ah is given by 

Ah = Cp'AB + "f'Cp'UcPp, + —bv^B + nbc'i^,. (33) 

Note that putting 7 ' = 7 and P' = P recovers the decomposition (31) for the original 
transition matrix A. 

Now, since H is assumed identifiable, the constraint > 0 has the 

unique solution = (1,0), or equivalently (Y,P') = (7,/3)- Thus, with 

exact knowledge of the various moments, only a single pair of values ( 7 ', P') will yield 
a non-negative matrix (33). This perfectly recovers 7 , P and the original transition 
matrix A. 

In practice we plug into (33) the empirical versions A, k, cti, iii and Vi, where iii, 
Di are the left and right singular vectors of R^'^'> , corresponding to the singular value Pi. 
As described in Appendix D.5, the values ( 7 , /3) are found by maximizing a simple two 
dimensional smooth function. The resulting estimate for the aliased transition matrix 
is 


A = CpAB + pC^iiick + ^bv\B + kbck. 

The following theorem proves our method is consistent. 

Theorem 3. Let H be a 2A-HMM satisfying assumption (A1-A3) with aliased states 
n— 1 and n. Then as T —> 00 , 


A = A + op(l). 


6 Numerical simulations 

We present simulation results, illustrating the consistency of our methods for the de¬ 
tection of aliasing, identifying the aliased component and learning the transition matrix 
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Figure 2; The aliased HMM (left) and its corresponding non-aliased version with states 
3 and 4 merged (right). 


A. As our focus is on the aliasing, we assume for simplicity that the output parameters 
9 and the projected stationary distributions tt are exactly known. 

Motivated by applications in modeling of ion channel gating [Crouzy & Sigworth, 
1990, Rosales et al., 2001, Witkoskie & Cao, 2004], we consider the following HMM 
H with n = 4 hidden states (see Fig.2, left). The output distributions are univariate 
Gaussians N{^i, af) . Its matrix A and are given by 


/ 0.3 

0.25 

0.0 

0.8 \ 

/Si 

= A((3,1) 

0.6 

0.25 

0.2 

0.0 

f02 

= A((6,1) 

0.0 

0.5 

0.1 

0.1 

’ *3 

= A7(0,l) 

V 0.1 

0.0 

0.7 

oa) 

fe^ 

= A((0,1) 


States 3 and 4 are aliased and by Procedure 1 in Appendix C.3 this 2A-HMM is iden¬ 
tifiable. The rank-1 matrix has a singular value a = 0.33. Fig.2 (right) shows its 
non-aliased version H with states 3 and 4 merged. 

To illustrate the ability of our algorithm to detect aliasing, we generated T out¬ 
puts from the original aliased HMM and from its non-aliased version H. Fig.3 (left) 
shows the empirical densities (averaged over 1000 independent runs) of the largest sin¬ 
gular value of R^‘^\ for both H and H. In Fig.3 (right) we show similar results for 
a 2A-HMM with a = 0.22. When a = 0.33, already T = 1000 outputs suffice for 
essentially perfect detection of aliasing. For the smaller a = 0.22, more samples are 
required. 

Fig.4 (left) shows the false alarm and misdetection probability vs. sample size T 
of the aliasing detection test (28) with threshold Ht = 2T~ 3. The consistency of our 
method is evident. 

Fig.4 (right) shows the probability of misidentifying the aliased component 6^. We 
considered the same 2A-HMM H but with different means for the Gaussian output 
distribution of the aliased states, p.3 = {0,1, 2}. As expected, when is closer to 
the output distribution of the non-aliased state (with mean = 3), identifying the 
aliased component is more difficult. 
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Figure 3; Empirical density of the largest singular value of with a = 0.33 (left) 
and a — 0.22 (right). 




Figure 4; Misdetection probability of aliasing/non-aliasing (left) and probability of 
misidentifying the correct aliased component (right). 



Figure 5; Average error E||A — A||p and runtime comparison of different algorithms 
vs. sample size T. 16 






















Finally, to estimate A we considered the following methods: The Baum-Welch 
algorithm with random initial guess of the HMM parameters (BW); our method of 
moments with exactly known 0 (MoM+Exact); BW initialized with the output of our 
method (BW+MoM+Exact); and BW with exactly known output distributions but ran¬ 
dom initial guess of the transition matrix (BW-rExact). 

Eig.5 (left) shows on a logarithmic scale the mean square error E||A — A\\^p vs. 
sample size T, averaged over 100 independent realizations. Eig.5 (right) shows the 
running time as a function of T. In these two figures, the number of iterations of the 
BW was set to 20. 

These results show that with a random initial guess of the HMM parameters, BW 
requires far more than 20 iterations to converge. Even with exact knowledge of the 
output distributions but a random initial guess for the transition matrix, BW still fails 
to converge after 20 iterations. In contrast, our method yields a relatively accurate 
estimator in only a fraction of run-time. Eor an improved accuracy, this estimator can 
further be used as an initial guess for A in the BW algorithm. 
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A Proofs for Section 3 (Decomposing A) 


Proof of Lemma 1. Writing each term in the decomposition (8) explicitly and summing 
these together we find a match between all entries to those of A. 

As a representative example let us consider the last entry „ = P{n \ n). The 
first term gives 

= (1 - j3)P{n I n). 


The second term gives 

= —/3(1 —/3)(P(n I n—1) —P(n I n)). 

The third term, 

= -ei 

= -/3(a„_i-/3)P(n I n-l) 

- (l-/3)(a„-/3)P(n | n). 

And lastly, the fourth term gives 

(k 6 c] 3)[„_„] = /?(«„_!-/3)P(n I n-1) 

- PAn - P)P(n\n). 


Putting P(n I n) = P{n— 11 n) + P{n \ n) and P{n \ n— 1) = P{n— 11 n— 1) + 
P{n I n—1), and summing all these four terms we obtain P{n \ n) as needed. The other 
entries of A are obtained similarly. □ 


B Proofs for Section 4.1 (Minimality) 

Let H = {A, 6, ttO) be a 2A-HMM. For any k > 1 the distribution PH,k € Vh can be 
cast in an explicit matrix form. Let o G y. The observable operator Tg{o) G is 

defined by 


Tg(o) = diag {fg, {o),fg^{o),..., fg^ (o)) . 

Let y = {yo, yi,..., yk-i) G be a sequence of fc > 1 initial consecutive observa¬ 
tions. Then the distribution PhAv) given by Jaeger [2000], 

PhAv) = ATe{yk-i)A . ..ATeiyi)ATeiyo)A. (34) 

Proof of Theorem 1. Let us first show that 7^ 0 is necessary for minimality, namely 
if = 0 then H is not minimal, regardless of the initial distribution 7r°. The non¬ 
minimality will be shown by explicitly constructing a n — 1 state HMM equivalent to 
H. Let us denote the lifting of the merged transition matrix by 

A = CfiAB G 
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Assume that = 0. We will shortly see that for any 7r° and for any k >2 consecu¬ 
tive observations y = {yo, yi,... , yk-i) € we have that 

Teiyk-i)A...Tg{yi)ATg{yo)7T° (35) 

- Tg{yk-i)A.. .Tg{yi)ATg{yo)TT° (X b. 

Combining (35) with (34), and the fact that 1^6 = 0, we have that = 

Vf^^g^oy Since A has identical (n—l)-th and n-th columns, and = /e„ we 
have that g = ’P(a g ^o). Thus H' = {A, 9, 7f°) is an equivalent (n— l)-state 
HMM and H is not minimal, proving the claim. We prove (35) by induction on the 
sequence length k > 2. First note that since = 0, by Lemma 1 we have that 

A = A + b{{5^yB + Kc}). 

Since for any y G y, Tg{y)b = fg- {y)b, we have that 

Tg{y)A - Tg{y)A = fg^{y)b{{6'y B + kc}) cx b. 

This proves the case k = 2. Next, assume (35) holds for all sequences of length at least 
2 and smaller than k, namely, for some a € M 


Tg{yk-2)A ... Tg{yi)ATg{yQ)'n° 

= ab + Tg{yk- 2 )A ... Tg{yi)ATg{yo)Tr°. 

Using the fact that Bb = 0 we have Tg{yk-i)Ab = 0. Inserting the expansion of A in 
the l.h.s of (35) we get 

fe^{yk-i)b (^{d^'^YB + KCy"j 
X (ab + ATg{yk-2) ■ ■ ■ Tg{yi)ATg{yo)TT°^ ■ 

Since this last expression is proportional to b we are done. 

(ii) The case tt? = 0 or /3° = /?. As we just saw, having = 0 implies that the 
HMM is not minimal. We now show that if = 0 then H is not minimal either. By 
contraposition this will prove the first direction of (ii). 

So assume that = 0. Lemma 1 implies 

A = A + {Cp{d°'^Y + Kb)c}. (36) 

Now note that for all y € y, c^Tg (y) = fg^_^ since either tt? = 0 or /3° = /3 

we have that c^7r° = 0. Thus C0Tg{y)7r'^ = 0 and we find that 

Tgiyk-i)A ... ATg{yi)ATg{yo)7T° (37) 

= Tg{yk-i)A . ..ATg(yi)ATg(yYT^y (38) 
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Now since c^pCp = 0 we have that for any y G y, c^^Tg{y)A = 0 and thus expanding 
A by (36) we find that for any y G y, 

ATg{y)A = (i + + «b)4)re(y)i = ATe{y)A. 

Thus each A in the right hand side of (37) can be replaced by A and we conclude that 
V(^A.e,TvO) = 'P(^A e TT®)' Similarly to the case = 0 we have that H' = [A, 0, 7f°) 
is an equivalent (n— l)-state HMM and thus H is not minimal. 

In order to prove the other direction we will show that if H is not minimal then 
either = 0 or = 0. This is equivalent to the condition = 0. 

Assuming H is not minimal, there exists an HMM H' with n' < n states such that 
Vh' = Vh- Assumptions (A1-A3) readily imply that H' must have n' = n—1 states 
and that the unique n—1 output components are identical for H and H'. Since Vh' 
is invariant to permutations, we may assume that 9=6 and consequently the kernel 
matrices in (1) for both H and H' are equal K = K'. 

Let A' G be the transition matrix of H' and define H" = (A", 6", tt") 

as the equivalent n-state HMM to H' by setting /3" = j3. A" = CpA'B, 9” = 9'B 
and tt" = CpTz'. Note that for H”, by construction we have 5°“*"(<5“")'^ = 0. 

Now, by the equivalence of the two models H and H", we have that the second 
order moments given in (12) are the same for both. By the fact that K” = K, 

tt” = TT and by (22) in Lemma 3 we must have that (<5“ )^. Thus 

<5°u‘(^in) = 0 and the claim is proved. 

(i) The case tt? 7^ 0 and /3° 7^ /3. We saw above that if H is minimal then 7^ 0. 
Thus, in order to prove the claim we are left to show that if H is not minimal then 
= 0 . 

So assume H is not minimal and let H” be constructed as above. By way of 
contradiction assume 7^ 0. As we just saw, since H is not minimal then 
= 0. Thus by the assumption 7^ 0 we must have = 0. This implies that A is in 
the form (36). Since Vh = Vh" we have Ph ,2 = Ph ",2 where: 

Ph,2 = l^Te(7/2)ATe(yi)7r° 

= llTe{y2)(A+{Cp5°'^^ + Kb)4)Te(yi)7rO 

PHy2 = llTe{y2)A"Te{yi)Tr^. 

In addition, by the fact that K" = K, tt" = tt we must have that = M"^^\ 

where is defined in (16) and is defined similarly with the parameters of 

H” instead of H. By (21) in Lemma 3 we thus have 

= A' = A = 

Hence A” = A and Ph ,2 = Ph ".2 is equivalent to 

llTg{y2){Cp5°'^^ + ttb)4re(j/i)7r0 = 0. (39) 
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Figure 6: The feasible region Th (shaded) in the (r„_i,T„) plane. Any (r„_i,r„) S 
F// induces an FIMM equivalent to H via Lemma 2. The pair = (1,0), 

corresponding to the original transition matrix A, is indicated by a yellow point. Left; 
Q!„-i > an and Right: a„_i < 


Now, note that Vj/i, 2/2 G we have 
ihTg{y2)b = 0 

c}Tg{yi)Tv° = (/3°-/3)7r°/e^(2/i) 

ln?e(y2)C'/j = (/ei(y2),---,/e„_i(2/2))- 

Thus, ( 39 ) is given by 

{13° - P)Tr°fe^_^ (2/1) (/si (2/2), ■ ■ •, /e„_i (2/2)) • = 0 . 

Since by assumption (/ 3 ° — (3)Tr° ^ 0 we have V221,2/2 G 3 ^ 

/e„_i (2/1) (/si (2/2), ■ ■ ■, /e„_i (2/2)) ■ = 0 . 

For each i G [n — 1 ], multiplying by fg. (2/2) and integrating over 2/1,2/2 S 3 ^ we get 

KS°'^* = 0 . 

Since K is full rank we must have = 0 in contradiction to the assumption 
0 . This concludes the proof of the Theorem. 

C Proofs for Section 4.2 (Identifiability) 

C.l Proof of Theorem 2 

Before characterizing Th let us first give some intuition on the role of r„). Con¬ 
sider the n—1 dimensional columns {di \ i G [n]} of the matrix BA. These can be 
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Figure 7; Top; Plotting the columns of BA on the simplex for a 2A-HMM with aliased 
states {3,4}. Here, ag = /Sag + (1 — /3)d,4. Bottom; Any vectors aH,n-i, ^H,n within 
the depicted bars results in a matrix Ah with all entries non-negative, except in possibly 
the 2x2 aliased block. 


plotted on the n—l dimensional simplex, as shown in Fig.7 (top), for n = 4 and aliased 
states {3,4}. Recall that 

A H ij' n—l 1 '^n) — S ^ AS i^Trt-l ^ Tn) 

and let {aH,i \ i € [n]} be the columns of the matrix BAh € 

Since r„)“^ = B we have BAh = BAS{Tn-i, Tn)■ So the non-aliased 

columns of BAh are unaltered from these of BA, i.e. for all i G [n — 2], di — dH,i- 
The new aliased columns of BAh are 


- - I JfOUt 

^H,n—1 — “r 'l'n—1^ 

^H,n — 4 “ 

Thus r„_i (t„) determines the position of the vector dH,n-i icbH,n) along the ray pass¬ 
ing through dn-i and a„ (dashed line in Fig.7). 

Hence a necessary condition for Ah to be a valid transition matrix is that dH,n-i > 0 
and dH,n > 0, and one cannot take t„_i and r„ arbitrarily. In particular, there are 
and such that dH,n-i and dH,n are as “far” apart as possible by putting them on 
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the opposite sides of the ray connecting them, such that both sit on the simplex bound¬ 
ary. This is achieved by taking 


where 


•'H.n-l 

;=max 

^H,n 


I ^maxrout 

+ T„_^ O 


+ T, 


n—1 
max cout 


max _ 

^n-1 — 


^max 

= 


A'Mn—l,n} 

Yilciyij^X\{n—l,n} 


i(l+sign(5°“‘))-(a„), 


i(l-sign(5°"‘))-(a„)j 


> 0 
< 0 . 


(see Fig.7, bottom). Since we assumed as a convention that t„_i > r„ we have that 
any t„_i < and t„ > results in a non negative matrix BAh- Note that 

BAh > 0 implies AH[l,n- 2 ,l:n] > 0- 

Next, consider the new relative probabilities aH,i as defined by (4) with Au re¬ 
placing A. One can verify that these satisfy 


aH,i = -, 'I ^ supPi„ \|n—l,n|. 

'^n—l 

Obviously, a necessary condition for to be a valid transition matrix is that 


CX,' — T 

0 < aH,i = —*- — <1, t € supp;„\{n —1, n}. (40) 

'^n—1 

Define the minimal and maximal relative probabilities of the non-aliased states by 

Of™'" = min {ai I i € suppj„\{n—1, n}} 

a'nax _ maxlofi I i € suppj„ \{n— 1, n}}. 

Let a™" and be defined similarly. Taking 

^min ^ max 

T^n-l = a 

^min ^ min 

= a , 

we have a™" = 0 and a^ax _ Hence, for any t„_i > r™” and r„ < r™'" the con¬ 
straint (40) holds and consequently AH[\-n,i:n- 2 ] is non-negative. The corresponding 
columns d™”-! = d„ -I- and d™” = d„ -f are depicted in Fig.7 

(bottom). 

Combining the above constraints we have that the four parameters t™" , r™'" , , r “ax 

define the rectangle 


Fi 


■_min max] r_max .,.min] 
.n—1 5 'n—1 J ^ L' n 5 ' n J ? 


(41) 


which characterize the equivalent matrices Ah having all entries non-negative except 
of possibly in the 2x2 aliased block (see Fig.6). Thus we must have Th C Fi. 
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We are left to find the conditions under which the 2x2 aliased block is non¬ 
negative. Writing Ah explicitly we have that these conditions are 



= - Q;„_i)P(n | n- 1) 

-f (1 - T„_i)(r„_i - a„)P(n | n) > 0 

(42) 

,n—l,n 

= r„(a„_i - T„)P(n 1 n-1) 

+ (1 - Tn){an - T„)P{n | n) > 0 

(43) 

n—1,n—1 

= r„_i(a„_i-r„)P(n 1 n-1) 

-1- (l-r„_i)(a„ - r„)P(n | n) > 0 

(44) 

-^/f,n,n 

= - Q;„_i)P(n 1 n-1) 

+ (1 - - an)P{n 1 n) > 0. 

(45) 


As the case P{n \ n—1) = Pin | n) = 0 is trivial, we assume that at least one of 
P{n I n— 1), P{n \ n) is nonzero (and since by convention P{n \ n— 1) > P{n \ n), 
this is equivalent to P{n \ n— 1) > 0). 

Recall that by definition (5°“^ = P{n \ n— 1) — P{n \ n) (see (5)). We now consider 
the cases = 0 and > 0 separately. 


The case = 0. Consider first the off-diagonal constraint (43) for AH.n-i,n > 0^ 
taking the form 

^n(l {ctyj—1 CKyj))) ^ Ctn- 

Denote 

T° = «„/(1 - (a„_i - a„)). 

Since a„_i — q;„ < 1 we need t„ < t°. Similarly, (42) is satisfied if and only if 
Tn-l > T°. Thus in order for the off-diagonal entries A//„ „_i, A//„_i „ to be non¬ 
negative we need (r„_i, t„) G where 


r® = [t°,Oo] X [—00,T°]. 


(46) 


Next, the on-diagonal constraint (44) for A//„_i „_i > 0 is equivalent to 

Tn<Oin+ - an). (47) 

Similarly, the on-diagonal constrain (45) for A// „ „ > 0 is 

r„(a„_i - a„) < r„_i - a„. (48) 

Define the two linear functions 5°, /° : K —>■ M by 



dyj ~h (g!- 


^n—1 ^n 

CXri—l C^n 
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Note that is a fixed point of both g'^ and 

r° = 5°(r°) = /°(T°). 

Note also that for a„_i > the functions and /° are increasing, while for an-i < 
an they are decreasing. Thus, if q;„_i > a„ the constrains (47,48) are automatically 
satisfied for t„) € F^, so in this case A// „_i „_i, A// „ „ are also guaranteed to 
be non-negative. 

If a„_i < an then with t„_i > t„ (as we assume here) we have /°(r„_i) < 
p°(r„_i) < To and the constraints (47,48) take the form < r„ < g^(Tn-i)- 

Thus, in order for the on-diagonal entries AB,n-i,n-i and AH,n,n to be non-negative 
we must have (r„_i, t„) € Fg, where 

Ta = T„) e Fi I <Tn< g°(r„_i)}. (49) 

We are left to ensure that for a„_i < a„ the off diagonal entries are also non¬ 
negative. Indeed, since r„ < t„_i, t° is a fixed point and are de¬ 

creasing, for any t„) € Fg we automatically have that tq < t„_i and r„ < To, 
so (r„_i,T„) G Fg implies (T„_i,r„) G F^. Thus all entries of the aliasing block are 
guaranteed to be non-negative. 

To conclude, we have shown that for = 0 the feasible region (10) is given by 

an—i ^ an 
an—1 ^ an- 


Fi nF^ 
FiHFO 


The case 5°))* > 0. This case has the same characteristics as for the <5°“^ = 0 case, 
but it is a bit more complex to analyze. Define (as the analogues of t°) by 


-(1 -I- a„)P(n I n) ± y/A^ , 

where 

A = {an-iP{n\n-l) - {1 +an)P{n\n)^ 

-|-4a„P(n I > 0. 


And define the regions 

Fa = [r+,oo] x [t",t+] 

Fs = {(T„_i,r„) G Fi I <Tn< 5(t„_i)} 


(50) 


(51) 

(52) 
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where the functions / : M —> M are given by 


giTn-i) = 


and 


a„_iP(n I n — 1) — Q;„P(n | n) 
P{n I n-l)P(n | n){an-i - a„) 
( -P{n\n) \ 


X r„_i - 


firr^i) = 


X T„_l - 


—P{n I n) 

P(n I n-l)P(n | n)(Q;„_i - a„) 
'an-iP{n I n-1) - a„P{n \ n) 


(53) 


(54) 


Lemma 8. LetTi be defined in (41) and let T 2 and T^be defined according to whether 
= 0 (46,49) or not (51,52). Then the feasible region Th satisfies 


Th 


Fi n r2 ttn-i > oi. 
Fi n r3 cun-i < a. 


Proof. As the case 5°“^ = 0 was treated above, we consider the case > 0. Con¬ 
sider first the off diagonal constraint (43). Multiplying by (—1), we need to solve the 
following inequality for r € M, 

- T(a„_iP(n | n- 1) (55) 

— (1 + a„)P(n I n)) — Q;„P(n | n) < 0. 

We first solve with equality to find the solutions t”,t+ given in (50). Thus, since 
A > 0 we have that any feasible t„ must satisfy t~ < Tn < . Note that the 

constraint (42) for t„_i is the complement of (43), and by assumption t„_i > t„, so 
(42) is satisfied iff t+ < t„_i. Thus, the region F 2 given in (51) indeed characterize 
the non-negativity of both A„_i „ and „_i. With some algebra, t+ and t~ can be 
shown to satisfy the following useful relations: 


• If a„_i > a„ then 


P{n I n — 1) 




< T" < 0 


and 

+ ^ Q;„_iP(n|n-l)-a„P(n|n) 

- « 


(56) 


(57) 
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• If a„_i < a„ then 


T 


< - 


< 


P{n I n) 
a„_iP(n I n-1) - a„P(n | n) 


< 


(58) 


We proceed to handle the constraints (44) and (45) corresponding to the region Fa. 
We begin by solving the inequality (44): 

-r„(P(n I n) + + a„P(n | n) 

+ r„_i(Q;„_iP(n | n-1) - a„P(n | n)) > 0. 


Note that for (r„_i, r„) € Fi we have (P(n | n) + > 0. Rearranging we get 

that in order for AH,n-i,n-i to be non-negative we must have that 


P(n I n) , , , 

if T„_i >-then T„ < g(r„_i), 

where g is the function given in (53). Similarly, consider the condition (45), 

Tn(anP{n I n) - a„_iP(n | n-1) + 

+ P(n I n)(a„ - T„_i) > 0 . 

Rearranging we find that in order for AH,n,n > 0 we must have 

^ \ _ , < ar,-lP{fi\n-l)-a^P(n\n) 

I ' n j V ' 1 J ‘ ^—1 — 


(59) 


Tn > f(Tn-i) Otherwise, 


P(n I n—1) —P(n | n) 


(60) 


where the function / is given in (54). Note that g (res. /) defines the boundary where 
(44) (res. (45)) changes sign, namely any pair (t„_i,t„) = (r„_i, p(t„_i)) is on the 
curve making Equation (44) equal zero, and similarly /(t„_i) is such that (t„_i , t„) = 
(r„_i, /(r„_i)) is on the curve making (45) equal zero. Having the boundaries p, / in 
our disposal let us first consider the case Qf„_i > 


The sub-case q;„_i > a„. We show that in this case, having (r„_i,T„) e Fi n F 2 
already ensures that conditions (59) and (60) are trivially met, which in turn implies 
the non-negativity of both AH,n-i,n-i and AH,n,n- This is done by showing that for 
any (T„_i,r„) € Fi n F 2 the curve 5(t„_i)) is above (t„_i,t+). Similarly, 

for T„_i < (a„_iP(n | n-1) - a„P(n | n))/(5°)() the curve /(t„_i)) is above 
(r„_i, T+) andforT„_i > (a„_iP(n | n-1) - a„P(n | n))/6°'i\ the curve (r„_i, /(t„_i)) 
is below r“), thus making conditions (59) and (60) true. Toward this end con¬ 
sider the equality p(t) = /(t) given by 

0 = (^Q;„_iP(n I n-1) -I-(1 - a„)P(n I n)^ X 

-f t(( 1 -f a„)P(n | n) 

-a„_iP(n I n-1)) - a„P(n | n)^ 
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^'T-n 


ATti 




Ar-n-i 









1 _ 


no constraints 


ATr^ > 0 


Ar^i—i, Arn = 0 


Figure 8: The effective feasible region for various constraints, ensuring Ah(1 + 
At„) > 0 for |At„_i| , |At„| « 1. 


Thus if {an-iP{n \ n—1) + (1 — a„)P(n | n)) = 0 we have that g = f identically. 

Otherwise we need to solve again (55) so the solutions are with 5 (t+) = /(t+) 

and g{T~) = f{T~). In addition one can show that t+ and r” are in fact fixed points 
of both g and /, so together we have 

=9{r^) = f{T^) 

T~ = g{T~) = fiT~). 

Inspecting one can see that for t„_i > -P{n\n)/S°]t\, g(r„_i) is mono¬ 

tonic increasing and concave. Since by (57) we have > —P{n\n)/6^\ we get 
that for Tn-i > T+ we must have g{Tn-i) > r'*' as needed. Similarly, for < 

T„_i < (a„_iP(n I n — 1) — Q;„P(n | n))/S'^\ the function is increasing and 

convex and thus above t+, while for (a„_iP(n I n-1) - a„P(n | n))/6°]i\ < t„_i it 
is increasing but always below t~ . Thus, for q;„_i > we have that Ti n r 2 also 
characterize the non-negativity of A// „_i „_i and Afi ri,n as claimed. 

The sub-case q;„_i < a„. Note that by (58) we have > {an-iP{n I n-1) - a„P(n | n))/(5°“^ 
Thus for T+ < T„_i both g and / are decreasing and convex and < p(t„_i). 

Thus in order to ensure (59, 60) we need /(t„_i) < t„ < Thus, r3 as defined 

in (52) characterize the non-negativity of Afi ri,nj Finally we need to show 

that having t„) € Fa also ensures the non-negativity of AH,n,n-i and AH,n-i,n- 
But by (58) we have that for r„_i > both /(r„_i) > r” and thus Fa C r 2 . 

Hence we have shown that F// is characterized as claimed. □ 

Lemma 9. The set F h is connected. 

Proof. If a„_i > an then F^f = FinF 2 isa rectangle and thus connected. In the case 
an-i < an we have that < g{Tn-i) and are both decreasing and convex thus 

the region Fa with intersection with a rectangle is a connected set. □ 


C.2 Conditions for iF/fl = 1 

Let us first write 

r„) = (1, 0) -I- (At„_i, At„). 
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A 

^i.n— 

1 = 0 

0 <C Ai^ri—1 1 

A 

1 = 1 

Ai^n — 0 


— 


— 


— 


—' 

— 

1 

1 

— 

— 

^i,n = 1 

— 

— 

¥ 

not 

aliased 


Table 1: For any state i G X \ {n— 1, n} with Ai ^ G {0,1} or A^ n-i € {0,1}, pick 
the relevant diagram. 


o 

II 







aj = l 






Table 2: For any state j G suppj„ \{n— 1, n} with aj G {0,1} pick the corresponding 
diagram. 
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^n— 

1,71—1 ^ -^71,71 

^71—1,71—1 - ^71,71 

-^71,71 
^71—1,71—1 
^ ^71,77—1 

-^71—1,71—1 --^71,71—1 

o 

il 

e 

i 


1 



1 

















o 

A 

e 

i 























Table 3: If an-i > a„ pick the relevant diagram from here. 


^n—l,n—l — 0 I ^n—l,n— 1^0 


A 


n,n 


A 


n,n 


= 0 




Table 4: If a„_i < pick the relevant diagram from here. 


We characterize the conditions for jT//1 = 1 by determining the geometrical constraints 
the entries of the transition matrix A pose on At„) in order to ensure A//(l + 

At„_i, At„) > 0. Note that |r//| = 1 iff these constraints imply that (At„_i, At„) = 
(0,0) is the unique feasible pair. 

As a first example, consider a 2A-HMM H having a transition matrix with all 
entries being strictly positive, A > e > 0. Since the mapping (9) is continuous in 
At„_i, At„, there exists a neighborhood A C of (At„_i, At„) = (0,0), such that 
for any (At„_i, At„) G N the matrix Aff{l + At„_i, At„) is non-negative, and thus 
N C Th- This condition can be represented in the (At„_i, At„) plane (i.e. as the 
’’full” diagram Fig.8. On the other hand, the condition that (At„_i, At„) = (0,0) is 
the unique feasible pair can be represented by a point like diagram as in Fig.8. 

In general, the entries of the transition matrix A put constraints on the feasible 
(At„_i, At„) only when r„) = (1, 0) is on the boundary of F//. These con¬ 
straints can be explicitly determined in terms of A’s entries by considering the exact 
characterization of F// given in Theorem 2. Note however that by the fact that F// is 
connected, and as far as the condition IF//1 = 1 is concerned, we only need to consider 
the shape of these constraints in a small neighborhood of r„) = (1, 0), i.e for 
|At„_i| , |At„| « 1. Any such neighborhood can be represented on the plane (as 
in Fig.8). The shape of this neighborhood for a given H is called the effective feasible 
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region of Th- 

Now, as the example with A> e > 0 shows, a non-trivial constraint on the (effec¬ 
tive) feasible region must results from A having some zeros entries. Each such a zero 
entry, as determined by its position in A, put a boundary constraint on (At„_i, Arn)- 
These in turn corresponds to a suitable diagram in (as the diagram for Ar„ > 0 in 
Fig.8). The effective feasible region of A is obtained by taking the intersection of all 
these diagrams. The exact correspondence between A’s entries and the corresponding 
diagrams is given in Tables 1,2,3,4. The procedure for determining the effective feasi¬ 
ble region of a 2A-HMM is given in Algorithm 1. The correctness of the algorithm is 
demonstrated in the proof of Lemma 8. 


Algorithm 1 determining the effective feasible region for minimal 2A-HMM H 
1: permute aliased states so that P{n \ n— 1) > P{n \ n) 

2: collect the following diagrams: 

- Vi G [rr—2] with „ G {0,1} or Ai^n-i G {0,1} pick the relevant diagram 
in Table 1 

- Vj G suppj„ \{n—1,n} with aj G {0,1} pick corresponding diagram in 
Table 2 

- if a„_i > an pick relevant diagram in table 3 and if a„_i < pick relevant 
diagram in Table 4 

3: Return the intersection of all the regions obtained in previous step 


C.3 Examples. 

We demonstrate our Algorithm 1 for determining the identifiability of 2A-HMMs on 
the 2A-HMM given in Section 6, shown in Fig 2 (left). Going through the steps of 
Algorithm 1 we get the following diagrams for the effective feasible region: 



from Table 1: from Table 1: from Table 2: from Table 2: 


—0AAi^ 4:^0 -^2,4—0AA2,3^0 Ctl — 1 CX2 —0 

Since their intersection results in a point like diagram, this 2A-HMM is identifiable. 

More generally, for a minimal stationary 2A-HMM satisfying Assumptions (Al- 
A2) with aliased states n and n—1, a sufficient condition for uniqueness is the following 
constraints on the allowed transitions between the hidden states: 3in-i, jn-iAn, jn G 
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[n — 2] such that 


A in-l —>■ n—1 —t jn-l 

X 

ifi—i —>■ n. —>■ * 

/ in^ jn 

X 

tfi —^ Tl —1 —^ ^ 


X 

* n—1 —>■ jn 


X 

* —>■ n —>■ jn-l ■ 

One can check that these conditions give the same set 

of diagrams as above. 


D Proofs for Section 5 (Learning) 

D.l Proof of Lemma 3 

The claim in (21) that = BACp = A follows directly from (14) and (16). Next, 
we have 

i?(2) = BAACp - BACpBACfi 
= BA{I^-C0B)ACp 
= BAbcJpACp, 

where the last equality is by the fact that (/„ — C^B) = Since BAb = and 
c^pACp = we have as claimed in (22). 

As for (23) we have, 

rA) = BAAACfj - BAACpBACp 

- BACfiBAACfi + BACfsBACpBACii 
= BA{In - CpB)A{Ir, - C0B)AC0 

Since by definition k = (A^Ab we have R^^'> = kR'A'> and the claim in (23) is proved. 
Finally, 

i^(^) = BA(^diag(/C[._c]) - C'/j diag(A:[._c])^)^C/3- 

Since 

diag(/C[._c]) - Cp diag(.A[._c])-B = Kn-i^McpY 

we have that = Kn-i,cR^^^ as claimed in (24). 

D.2 Proof of Lemma 4 

Assumption (A2) combined with the fact that the HMM has a finite number of states 
imply that the HMM is geometrically ergodic: there exist parameters G < oo and 
Y G [0,1) such that from any initial distribution 7r°, 

||A‘7r°-7r||^ < 2GV'*, Vf € N. (61) 
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Thus, we may apply the following concentration bound, given in Kontorovich & Weiss 
[2014]: 

Theorem 4. Let Y = Y[),, Tr-i G be the output of a HMM with transition ma¬ 
trix A and output parameters 6. Assume that A is geometrically ergodic with constants 
G, f. Let F : (T]); • ■ • > ^t-i) >—>■ K any function that is Lipschitz wit constatnt I 
with respect to the Hamming metric on Then, for all e > 0, 

Pr(|F(r)-EF|>eT)<2exp(^-^fc^^). (62) 

In order to apply the theorem note that Vf € {1,2,3}, E[Mf^] = for any 
i,j G [n —1]. In addition, following Assumption (A3), (T — is [t + 1)L^- 

Lipschitz with respect to the Hamming metric on . Thus, taking e « T~^ in 
Theorem 4 and applying a union bound on i, j readily gives 

jCiA) = AfW+Op (r-s) . 

The kernel-free moments given in (25) incur additional error which results in a 
factor of at most l/((Tmin(^)^ min^ tt^) hidden in the Op notation. Since RA'> are (low 
order) polynomials of M^*'\ the asymptotics Op carry on to the error in 

A similar argument yields the claim for . 


D.3 Proof of Lemma 6 

Let (Ti and cti be the largest singular values of and respectively. Combining 
Weyl’s Theorem [Stewart & Sun, 1990] with Lemma 4 gives 




< 3 - 11 < 


jliA _ }}( 2 ) 


Op(r-5), 


Recall that under the null hypothesis Ho, we have cri = 0. Thus, with high proba¬ 
bility (7i < ^oT ~2 , for some > 0. In contrast, under Hi we have ai = a > 0, thus 
for some > 0, (fi > ct — ^iT“ 2 . Hence, taking T sufficiently large, we have that 
for any Ch > 0 and 0 < e < ^, with hp = ChT~ 2 +‘^, 


in case Ho : di < hp 

in case Hi '. (Ji> hp, 


with high probability. Thus, the correct detection of aliasing is with high probability. 


D.4 Proof of Lemma 7 

Let us define the following score function for any i G [n — 1], 


score 


(z) = ^ 

ie[n-l] 
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According to Eq. (30) the chosen aliased component is the index with minimal score. 
Hence, in order to prove the Lemma we need to show that 

lim Pr(3i ^ n—1 : score(z) < score(n —1)) = 0. 


By Lemma 4 and (20) we have 

Ai) Ai) 

pU) = pU) + ^ = Kr^-l 

Vt Vt 

i?(2) = i?(2) + 

Vf 

for some with Op{^r) = 1 and Op{^p^) = 1. Thus, 


score 


= E - Kpn-i^n 


je[n-l] 


0 . 


In contrast, for any z ^ n — 1 we may write score(z) as 


E 

je[n-l] 


(Kj^i Kj^n-l)R^ ^ Kj^n-l^n) 


Applying the (inverse) triangle inequality we have 

score(z) > cr^ ||■^[•.n-l] - - Op(r"5). 

Since K is full rank, — ATj. jj ||^ > 0. Thus, for any z ^ n—1 as T —> oo, 

w.h.p score(z) > score(rz —1). Taking a union bound over i yields the claim. 


D.5 Estimating 7 and (3 

We now show how to estimate 7 and j3. As discussed in Section 5.4, this is done 
by searching for 7',/3' ensuring the non-negativity of (33), namely, A'p{j',/3') > 0, 
where 

A'nij', / 3 ') = CpAB + j'Cp’UCaA + + k bcpA. 

1 

We pose this as a non-linear two dimensional optimization problem. Lor any 7 ' > 0 
and 0 < /?' < 1 define the objective function /z : M2 by 

Hi',3')= min {7'A'^(7',/3')y}. 

i,je[n\ 

Note that h{l', H) A 0 iff A'^H',P') does not have negative entries. Recall that by 
the identifiability of H, if we constrain 7 ' > 0 then the constraint A'p{H, H) A 0 has 
the unique solution ( 7 , /3) (this is the equivalent to the convention t„_i > t„ made in 
Section 4.2). Namely, any {H,H) 7^ {1,3) results in at least one negative entry in 
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A'h{i\P')- Hence, h{j', P') has a unique maximum, obtained at the true (7,/3). In 
addition, since ||tt|j 2 = ||'y|l 2 = ^ feasible solution must have 7' < 2/a. So our 

optimization problem is: 

(7,/3)= argmax Hj', P') (63) 

(7'./3')6[0.f]x[0.1] 

This two dimensional optimization problem can be solved by either brute force or any 
non-linear problem solver. 

In practice, we solve the optimization problem (63) with the empirical estimates 
plugged in, that is 

A'h{i' tP') = CpAB + 7 'C'/ 3 'MiC^, + '^bv\B + kbc^,. 

The empirical objective function h{'Y, P') is defined similarly. Such a perturbation may 
results in a problem with many feasible solutions, or worse, with no feasible solutions 
at all. Nevertheless, as shown in the proof of Theorem 3, this method is consistent. 
Namely, as T —> 00 , the above method will return an arbitrarily close solution (in || • ||p) 
to the true transition matrix A, with high probability. 

D.6 Proof of Theorem 3 

Recall the definitions of , P') and its empirical version A^ij', P'), given in the 

previous Section D.5. To prove the theorem we show that 

i'^(7,/3)-A'^(7,/3) Ao. 

F 

Toward this goal we bound the l.h.s by 

A'Hil, P) — AAij, P) + A'jji//, P) — AAij, P) , (64) 

F F 

and show that each term converges to 0 in probability. 

We shall need the following lemma, establishing the pointwise convergence in 
probability of Ah to Ah'- 

Lemma 10. For any 0 < 7' and 0 < ^' < 1, 

Anik/P') - Ah{j',P') =op(l). 

F 

~ P ~ P 

Proof. By (29), A —In addition, in Section D.3 we saw ai —> a and one can 

P 

easily show that k —> k. Thus, in order to prove the claim it suffices to show that 

P P 

Ui —> u and Vi —> v. By Wedin’s Theorem [Stewart & Sun, 1990]: 

i?(2) _ ^(2) 

I!mi - MII2 < c - 

a 

for some C > 0. Combining this with Lemma 4 gives that Hiii — tt ||2 = Op{T~i). 
The same argument goes for IjDi — rt|| 2 . □ 
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We begin with the second term in (64). The first step is showing that the estimated 
parameters 7 , /3 in (63) converge with probability to the true parameters 7 , (3. We 
first need to following lemma, establishing the convergence of h to h uniformly in 
probability: 

Lemma 11. For any e > 0, 


Pr 


sup 

,(7'./3')6[0,2]x[0,1] 


M7',/3')-M7',/3') 


> e 


0 ( 1 ). 


Proof. Note that h{y', (5') is the value of the minimal entry of a matrix with all entries 
being polynomials of 7 ', f}' with bounded coefficients. Thus h is Lipschitz. In addition 
[ 0 , 2 ] X [0,1] is compact and, similarly to Lemma 10, h{'y', /3') converges in probability 
pointwise to h{j', /?'). Hence, the claim follows by Newey [1991, Corollary 2.2]. □ 

Lemma 12. ( 7 ,^) A {y,P). 

Proof. Recall that ( 7 , j3) are the maximizers of h{y', /3') and ( 7 , /?) are the maximizers 
of 11 ( 7 ', /?'), over ( 7 ', P') G [0, 2] x [0,1]. To prove the claim we need to show that for 
any i5 > 0 , 


Pr 


( (jJ) - (7,/?) > <5) = 0(1). 


Toward this end define 


e((5) = h(j,P) - max h(Y,P' 
II(7',/3')-(7.^)II><5 


Note that e((5) > 0 since h(Y, P') has the unique maximum ( 7 , P). 
Now,by Lemma 11, we have that 


Pr 


h{Y,P')-h{Y,P') >A)/4 = 0 ( 1 ). 


(65) 


Thus, if we show that sup 
claim is proved. So assume 


h — h 


< e(())/4 implies ( 7 , P) - ( 7 , P) 


< S then the 


sup 

7'./3' 


h{Y,p')-hiY,P') 


< <S)/4. 


Toward getting a contradiction let us assume that 
following relations hold: 


(7,/3) - (7,/3) 


> 6. Then the 


h{f,P) < hY,P)-e{6) 
HfjP) < h{f,P) + e{6)/4: 
Ml,P) > ^(7,/3) - e(5)/4. 
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Thus, 


Ml,P) < Hi,I 3) - eiS)/2, 


in contradiction to the optimality of ( 7 , /3). □ 

By Lemma 12, ( 7 ,/?) —(7,/3). Since H is minimal. Theorem 1 implies 7 > 0 
and thus 7 >p 7 / 2 . In addition. Ah is continuous in the compact set [ 7 / 2 ,2] x [0,1]. 
Thus, by the continuous mapping theorem we have 


This proves the case for the right term of (64). 

The convergence in probability of the left term of (64) to zero is a direct conse¬ 
quence of the following uniform convergence lemma: 

Lemma 13. 



sup 

(7'./3')6[i.2]x[0.1] 






F 


Op(l). 


Proof. Since 7 ' > 7/2 we have that for any i,j € [n], Anij', P')^j is Lipschitz. 
In addition, by Lemma 10, for any G x [0,1], each entry Ah{j' , /3')ij 

converge pointwise in probability to Ah{A^ P')ij- Finally, [ 7 , 2 ] x [0,1] is compact. 
Thus, the claim follows from Newey [1991, Corollary 2.2] with an application of a 
union bound over i,jG[n]. □ 


39 



